All Questions
Tagged with scikit-learnnlp
72 questions
0votes
0answers
39views
Keep training pytorch model on new data
I'm working on a text classification task and have decided to use a PyTorch model for this purpose. The process mainly involves the following steps: Load and process the text. Use a TF-IDF Vectorizer....
1vote
1answer
907views
LLAMA MODEL WITHOUT USING HUGGINGFACE API
Is it possible to obtain the llama model alone as open source code without using the Huggingface API so that it can be hosted on our server?
1vote
1answer
389views
Text segmentation problem
I am new to ML and trying to solve problem of text segmentation. I have a transcript of news show and I want to split this transcript into parts by topic. I tried to google and asked chatgpt and found ...
0votes
0answers
129views
On which texts should TfidfVectorizer be fitted when using TF-IDF cosine for text similarity?
I wonder on which texts should TfidfVectorizer be fitted when using TF-IDF cosine for text similarity. Should TfidfVectorizer be fitted on the texts that are analyzed for text similarity, or some ...
2votes
2answers
130views
In sklearn tfidf what is the difference between term frequecy and document frequency
Looking at the sklearn tfidf page: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html and trying to understand the difference between term frequency ...
3votes
4answers
2kviews
Accuracy is getting worse after text pre processing
I'm working a multi-class text classification project. After splitting the dataset into train and test datasets, I've applied the below function on the train dataset (AKA pre processing): ...
3votes
1answer
574views
Is there a way to map words to their synonyms in tfidf?
I have the following code: ...
1vote
1answer
197views
Why is max_features ordered by term frequency instead of inverse document frequency
In the docs: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html it is explained that max_features is ordered by ...
0votes
1answer
233views
LinearSVC training time with CountVectorizer and HashingVectorizer
I am currently trying to build a text classifier and I am experimenting with different settings. Specifically, I am extracting my features with a CountVectorizer ...
0votes
1answer
59views
Optimal clusters for K-means not clear - any ideas?
I have a toy dataset of 10,000 strings of people's names, addresses and birthdays. As a quirk of the data collection process it is highly likely there are duplicate people caused by typos and I am ...
1vote
0answers
18views
What can be the approaches to merge (ensemble) a NON-Probabilistic model with RandomForest?
I have a RF for Text classification and it gives me accuracy. Almost same metric is given by another model built using ...
4votes
1answer
1kviews
How to perform entity level train-val-test split for NER task?
A normal and stratified split option is provided by sklearn method that can be used for ML problems like multi-class classification. This is relatively easier to do as (1) one sample has one class, ...
0votes
3answers
156views
Creating numeric word representation of input sentences resulting in MemoryError
I am trying to use CountVectorizer to obtain word numerical word representation of data which is essentialy list of 160000 English sentences: ...
1vote
1answer
74views
Which algorithm is best for predicting diseases if symptoms are given? [closed]
After Topic modelling through LDA, I get the following dataset as result. ...
3votes
1answer
1kviews
How to identify/recognize that a sentence about talks about future?
Brief Introduction: I have a report/paragraph in which there are sentences with reference to future plans/outlooks/expectations for a particular entity. I want to extract all such sentences for now. ...